Scalable Percolation Search in Power Law Networks 
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We introduce a scalable searching algorithm for finding nodes and contents in random networks 
with Power-Law (PL) and heavy-tailed degree distributions. The network is searched using a prob- 
abilistic broadcast algorithm, where a query message is relayed on each edge with probability just 
above the bond percolation threshold of the network. We show that if each node caches its directory 
via a short random walk, then the total number of accessible contents exhibits a first-order phase 
transition, ensuring very high hit rates just above the percolation threshold. In any random PL 
network of size, A'^, and exponent, 2 < r < 3, the total traffic per query scales sub-linearly, while 
the search time scales as O(logAi'). In a PL network with exponent, r ~ 2, any content or node 
can be located in the network with probability approaching one in time 0(logA''), while generating 
traffic that scales as 0(log^ A^), if the maximum degree, kmax, is unconstrained, and as 0(A'^2+'=) 
(for any e > 0) if kmax = 0{vN)- Extensive large-scale simulations show these scaling laws to be 
precise. We discuss how this percolation search algorithm can be directly adapted to solve the well- 
known scaling problem in unstructured Peer-to-Peer (P2P) networks. Simulations of the protocol 
on sample large-scale subnetworks of existing P2P services show that overall traffic can be reduced 
by almost two-orders of magnitude, without any significant loss in search performance. 

PACS numbers: 
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I. INTRODUCTION AND MOTIVATION 

Scale-free networks with heavy-tailed and Power-Law 
(PL) degree distributions have been observed in several 
different fields and scenarios (see, e.g., and references 
therein). In a PL degree distribution, the probability 
that a randomly chosen node has degree k is given by 
P(fc) ^ fc^^, where r > is referred to as the exponent of 
the distribution. For 2 < r < 3 a network with N nodes 
has constant or at most O(logiV) average degree, but 
the variance of the degree distribution is unbounded. It 
is in this regime of t that the PL networks display many 
of the advantageous properties, such as small diameter 
[Tij l , tolerance to random node deletions [3 , and a natural 
hierarchy, where there are sufficiently many nodes of high 
degree. 

The searching problem in random power-law networks 
can be stated as follows 3 : Starting from a randomly se- 
lected node, the source, find another randomly selected 
node, the destination, through only local communica- 
tions. Equivalently, this can be cast into a messag- 
ing problem, where it is desirable to transfer a message 
from an arbitrary node to another randomly chosen node 
through local (i.e., first neighbor) communications. Since 
a searcher has no idea about the location of the destina- 
tion node in the network (unless, each node somehow has 
path information for all other nodes cached in it), the 
problem is indeed that of transferring a message from a 
node to all other nodes in the network. 

Another equivalent version of this problem appears 
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in unstructured P2P networks, such as GnutellajK 
Limewire[l6J. Kazaa[l|, MorpheusQ, and Imesh 
where the data objects do not have global unique ids, and 
queries are done via a set of key words. The reasons that 
search in PL networks is important for such unstructured 
P2P networks, include: (i) A number of recent studies 
have shown that the structure of these existing networks 
has complex network characteristics |l5l llTJ . including 
approximate power law degree distributions. Thus PL 
networks, or at least networks with heavy-tailed degree 
distributions, seem to naturally emerge in the existing 
services, (ii) Systematic P2P protocols that will lead to 
the emergence of PL networks with tunable exponents, 
even when nodes are deleted randomly, have been pro- 
posed recently ,18]. This makes it possible to system- 
atically design robust and random P2P networks that 
admit PL degree distributions, and that can exploit sev- 
eral properties of PL graphs that are extremely useful 
for networking services, e.g., low diameter, which allows 
fast searches, a randomized hierarchy, which allows op- 
timal usage of heterogeneous computing and network- 
ing resources without the intervention of a global man- 
ager, and extreme tolerance to random deletions of nodes, 
which provides robustness. 

In a straightforward parallel search approach in P2P 
networks, each query is given a unique id, and then each 
node on receiving the query message sends it out to all 
of its neighbors, unless the node has already processed 
the query, which the node can identify by checking the 
id's of the queries it has already processed. This leads to 
0{N) total queries in the network for every single query, 
and results in significant scaling problems. For example, 
Ripeanu et.al.jlS] estimated that in December of 2000 
Gnutella traffic accounted for 1.7% of Internet backbone 



traffic. 

As reviewed later in Section El a number of ad hoc 
measures, ranging from forcing an ultra-peer structure 
on the network, to a random-walk based approach, where 
it is assumed that a constant fraction of the nodes in the 
network caches each content's location, have been pro- 
posed in the P2P community. But none of these mea- 
sures provides a provably scalable and decentralized so- 
lution, where any content, even if it is located in only 
one node, is guaranteed to be found. The only system- 
atic work on searches in random PL networks reported 
so far j^, employs a serial search technique based on ran- 
dom walks and caching of content-lists of every node on 
all its neighbors (or on all its first and second neighbors), 
and is reviewed in greater detail in Section ITU 

We present a parallel, but scalable, search algorithm 
that exploits the structure of PL networks judiciously, 
and provides precise scaling laws that can be verified via 
extensive large-scale simulations f Section IIII|) . The key 
steps in our search algorithm are: (i) Caching or Con- 
tent Implantation: Each node executes a short random 
walk and caches its content list or directory on the vis- 
ited nodes. For example, for r w 2, this one-time-only 
random walk is of length 0(log A^), and thus the average 
cache size per node is 0{\ogN). (ii) Query Implanta- 
tion: When a node wants to make a query, it first ex- 
ecutes a short random walk and implants its query re- 
quest on the nodes visited, (iii) Bond Percolation: All 
the implanted query requests are propagated indepen- 
dently through the network in parallel using a probabilis- 
tic broadcast scheme. In this scheme, a node on receiving 
a query message for the first time, relays the message on 
each of its edges with a certain probability q, which is 
vanishingly greater than the percolation threshold, Qc, of 
the underlying PL network (see |l9| for a review of bond 
percolation in PL graphs). 

The physics of how and why percolation search algo- 
rithm works efficiently, can be described as follows. The 
bond percolation step, executed just above the percola- 
tion threshold, guarantees that a query message is re- 
ceived by all nodes in a giant connected component of 
diameter 0(log A^) and consisting of high-degree nodes. 
The content and query implantation steps ensure that 
the content list of every node is cached on at least one of 
the nodes in this giant component with probability ap- 
proaching one, and that one of the nodes in the giant 
connected component receives a query implantation with 
probability approaching one. Thus with 0{{k)Nqc) traf- 
fic (which scales sublinearly for PL graphs, as shown in 
Section IlIII and |19|). any content {even if it is owned by 
a single node in the network) can be located with proba- 
bility approaching one in time 0(log A^). 

An interesting outcome pertaining to the physics of 
networking is that the accessible contents/nodes exhibit 
a first-order phase transition as a function of the broad- 
cast or percolation probability q, showing a sharp rise as 
soon as q exceeds the percolation threshold qc- In con- 
trast to the accessible contents, the number of nodes and 



edges in the giant connected component exhibits only a 
second order phase transition. One of the primary appeals 
of the percolation search algorithm is that by combining 
serial random walks (i.e., content and query implanta- 
tions) with bond percolation it engineers a second-order 
phase transition into a first-order, allowing query-hits ap- 
proaching 100%, even when lim {q — qc) — 0. 

N-*oo 

While the proof that the percolation search algorithm 
leads to scalable traffic and low latency is based on fairly 
involved concepts, the algorithm itself can be easily im- 
plemented and directly adapted to solve the scaling prob- 
lem plaguing unstructured P2P networks. In Section Ivl 
we discuss such applications, and present simulation re- 
sults which show that even on sample large-scale sub- 
networks of existing P2P services, the overall traffic can 
be reduced by almost two-orders of magnitude, without 
any significant loss in search performance, by a direct 
implementation of percolation search. We also consider 
heterogeneous networks in Section IVII where the degree 
distribution is a mixture of heavy-tailed and light-tailed 
PL distributions. Such mixture distributions can model 
networks, such as the popular P2P services, where nodes 
belong to only few types, and each type has its own ca- 
pability and hence, its own degree distribution. We pro- 
vide both simulation and analytical studies of the im- 
provements to be accrued from the percolation search 
algorithms when implemented on random heterogeneous 
networks fSection lVI|) . 



II. PRIOR WORK AND COMPARISON 

The search algorithm by Adamic et. al. Q can be 
described as follows: To convey a message from node A 
to B, A sends a message that goes on a random walk 
through the network. When arriving at a new node, the 
message requests it to scan all its neighbors for the des- 
tination node B. If i? is not found among the neighbors 
of the current node, then the message is sent to one of 
the neighbors of the current node picked randomly. 

This algorithm exploits the skewed degree distribution 
of the nodes in PL networks: The random walk natu- 
rally gravitates towards nodes with higher degree, and 
therefore, by scanning the neighbors of these high degree 
nodes, the random walker is expected to soon be able to 
scan a large fraction of the network. One could also scan 
both the first and the second neighbors of a node visited 
through the random walk (rather than just scanning its 
first neighbors), in order to find the destination node B. 

Estimates for both search time and the number of mes- 
sages created (i.e., traffic) per query can be obtained 
as follows: For a power-law random graph with expo- 
nent T, the expected degree of a node arrived at via 
a random link is Za oc k^J^ oc N~~^, assuming that 
kmax = N^/^ . Also, the expected number of the second 
neighbors of a node randomly arrived at by following a 
link is around Zb ex A'^^^""^'. Therefore, assuming that 
nodes are not scanned multiple times during the random 



walk, the whole network is expected to be scanned after 
around: 



Aa ~ N/Za = N^ 



(1) 



hops if only the first neighbors are scanned, and 

Ab « N/zb = 7V3-f (2) 

if the second neighbors are scanned as well. For r = 
2, and the case where both first and second neighbors 
of a node are scanned, the predicted scaling is poly- 
logarithmic in the size of the network. 

While this technique is an important first step towards 
exploiting the hierarchical structure of PL networks and 
provides a sublinear scaling of traffic, there are several 
drawbacks that need to be addressed: 

(i) The actual performance of the algorithm is far 
worse than the theoretically predicted scaling laws. 
The primary reason for this discrepancy is that the 
estimates in Eqs. (^ and (|2J) are based on the as- 
sumption that the nodes scanned during a walk are 
unique, i.e., no node is scanned more than once. 
As pointed out by the authors in y, while this is 
a good approximation at the start of the walk, it 
quickly becomes invalid when a good fraction of 
the nodes have been scanned. Extensive simula- 
tions in (Jl show that actual scaling is significantly 
worse than the predicted values: For example, for 
r = 2.1, Eq. Q predicts a scaling of N^-'^^, but 
the actual scaling observed is more than a power 
of 5 worse (i.e., A^°-^^). The same is true for Q, 
where a scaling of N'^'^ is predicted for r = 2.1 
while Af°-^^ is observed. 

(ii) The random-walk search is serial in operation, and 
even assuming that the predicted scalings are ac- 
curate, the search time for finding any node or its 
content in the network is polynomially long in N. 
As an example, for t = 2.3, a value observed in 
early Gnutella networks, the predicted search time 
scalings are: Aa = N°-^^ or Ab = Ar0-39_ 

However, as mentioned before, these scalings are 
going to be significantly worse and we know that 
they will be at least larger than A^"-^^. 

(iii) In order to obtain the best traffic scalings, one 
needs to scale cache (storage) size per node poly- 
nomially; e.g., for T « 2, the cache size per 
node should increase as 0{Vn). Recall that the 
search strategy requires every node to answer if 
the node/content satisfying the query message is 
in any of its first neighbors or in any of its first and 
second neighbors. This scanning can be performed 
in two ways: (1) Without caching: For each query 
message, the node queries all its first (or first and 
second) neighbors. This strategy is then at least as 
bad as flooding, since for each independent search, 
all the links have to be queried at least once which 



results in a traffic per search of at least 0{N). (2) 
With Caching: Each node caches its content-list on 
all its neighbors, or on all of its first and second 
neighbors, as required by the protocol. Through 
the random walk, the walker can scan the contents 
of the neighbors (or both first and second neigh- 
bors) by observing the content lists in the current 
node without having to query the neighbors. The 
total cache size required per node in the case of the 
first-neighbor-only caching scheme is exactly the 
average degree of nodes (i.e., 0(log A^) for r = 2), 
and AT^/^-i (i.e., 0{^/N) for r = 2) when scan- 
ning of both the first and second neighbors are 
required. Thus the least traffic and equivalently, 
shortest search times, are obtained at the expense 
of an increased cache size requirements per node. 

As noted in the introduction and elaborated in the 
later sections, we build on the basic ideas in |4j|, and 
exploit the hierarchical structure of PL networks more 
efficiently to successfully resolve many of the above- 
mentioned issues. In particular, our results have the fol- 
lowing distinctive features: (1) The actual performance 
of the algorithm matches the theoretical predictions. (2) 
The algorithm takes 0(log A^) time and is parallel in na- 
ture. (3) The average cache size increases with the ex- 
ponent r, and is minimum for t — 2, when the traffic 
scaling is the most favorable. For example, for a random 
PL network with exponent, t — 2, and maximum degree 
kmax, we show that any content in the network can be 
found with probability approaching one in time 0(log A^), 



while generating only 0{N x 



2 log k„ 



-) traffic per query. 



Moreover, the content and query implantation random 
walks are O(logA^) in size, leading to the average cache 
size of 0{\ogN). Thus, if k„iax = cN (as is the case 
for a randomly generated PL network with no a priori 
upper bound on kmax) then the overall traffic scales as 
0(log A^) per query, and if kmax = vN (as is the usual 
practice in the literature) then the overall traffic scales 
as 0(ViVlog^ N) = 0{Ni+'') (for any e > 0) per query. 



III. THE PERCOLATION SEARCH 
ALGORITHM AND ITS SCALING PROPERTIES 

The percolation search algorithm can be described as 
follows: 

(i) Content List Implantation: Each node in a network of 
size A^ duplicates its content list (or directory) through 
a random walk of size L{N, r) starting from itself. The 
exact form of L{N,t) depends on the topology of the 
network (i.e., r for PL networks), and is in general a sub- 
linear function of N. Thus the total amount of directory 
storage space required in the network is NL{N,t), and 
the average cache size is L{N,t). Note that, borrowing 
a terminology from the Gnutella protocol, the length of 
these implantation random walks will be also referred to 
as the TTL (Time To Live). 



(ii) Query Implantation: To start a query, a query request 
is implanted through a random walk of size L{N, r) start- 
ing from the requester. 

(iii) Bond Percolation: When the search begins, each 
node with a query implantation starts a probabilistic 
broadcast search, where it sends a query to each of its 
neighbors with probability a, with q = Qc/j where Qc is 
the percolation threshold |19|. 

We next derive scaling and performance measures of 
the above algorithm. Our derivations will follow the fol- 
lowing steps: 

• First wc define high degree nodes and compute the 
number of high degree nodes in a given network. 

• Second, we show that after the probabilistic broad- 
cast step (i.e., after performing a bond percolation 
in the query routing step), a query is received by 
all members of connected component to which an 
implant of that query belongs. We also see that the 
diameter of all connected components is 0(log A^), 
and thus the query propagates through it quickly. 

• Third, we show that a random walk of length 
L{N, t) starting from any node will pass through 
a highly connected node, with probability approach- 
ing one. This will ensure that (i) a pointer to any 
content is owned by at least one highly connected 
node, and (ii) at least one implant of any query is 
at one of the high degree nodes. 

• Finally, we examine the scaling of the maximum 
degree of the network k^ax and give the scaling of 
query costs and cache sizes in terms of the size of 
the entire network N. We show that both cache size 
and query cost scale sublinearly for all 2 < r < 3, 
and indeed can be made to scale 0(log^ N) with 
the proper choice of r and k^ax- 



A. High Degree Nodes 

In this section we define the notion of a high degree 
node. For any node with degree k, we say it is a high de- 
gree node if fc > k„iax/'^- We assume that we deal with 
random power-law graphs which have a degree distribu- 
tion: 



Pk 



Ak- 



where A-^ = J2 '^^'^ ~ ^'^'^^ " 1' 



fc=2 



and ('(•) is the Riemann zeta function. A approaches the 
approximate value quickly as k^ax gets large, and thus 
can be considered constant. Thus the number of high 
degree nodes, H is given by: 



H = N \ A 



k=k 




Since for all 

ELa/(fc) > 

ELa/(fc) < Hjmk, 

above and below: 



decreasing, positive,/ (fc) we have 
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thus: 
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i^max 



We have shown that H = 0( i ). As we discuss 
in section IIIIEI there are two choices for scaling of 
kmax- If we put no prior limit on kmax it will scale 
like 0{N^^^'^^^^). As we will discuss, we may also con- 
sider k„iax — 0{N^/'^). We should note that the first 
scaling law gives H — 0(1), or a constant number of 
high degree nodes as the system scales. The second gives 
H — 0{N^^'^). For all r > 2, we have H scaling sublin- 
early in N. 

In the next sections we will show that without explic- 
itly identifying or arranging the high degree nodes in the 
network, we can still access them and make use of their 
resources to make the network efficiently searchable. 



B. High Degree Nodes are in the Giant Component 

In conventional percolation studies, one is guaranteed 
that as long as q — qc — e > 0, where e is a constant 
independent of the size of the network, then there will 
be a giant connected component in the percolated graph. 
However, in our case, i.e., PL networks with 2 < r < 3, 
lim^v^oo Qc = (for example, qc = °^l^ "^ for a PL 
network with exponent t = 2 [g), and since the traffic 
(i.e., the number of edges traversed) scales as 0{{k)Nq), 
we cannot afford to have a constant e > such that 
q = e + qc'. the traffic will then scale linearly. 

Hence, we will percolate not at a constant above the 
threshold, but at a multiple above the threshold: q — 
Qc/"f- We consider this problem in detail in a separate 
work 19]. The result is that if we follow a random edge in 
the graph, the probability it reaches an infinite component 
is S = z/kraax for a constant z which depends only on r 
and 7, but not kmax- 

Thus, since each high degree node has at least kmax/"^ 
degree, the average number of edges of a high degree node 
that connect to the infinite component (fcm/) is at least: 



^inf 



>s- 






The probability that a high degree node has at least one 
link to the infinite component is at least: 



P > l_(l_^)fe"-/2 
= 1 - (1 - ^)^- 



J2 



Thus both the average number of degrees that a high 
degree node has to the giant component, and the prob- 
ability that a high degree node has at least one edge to 
the giant component are independent of kmax- So as we 
scale up kmax , we can expect that the high degree nodes 
stay connected to the giant component. We can make z 
larger by decreasing 7, particularly, if I/7 > 2/(3 — r) 
we have z > 1 [l9| . 

It remains to be shown that the diameter of the con- 
nected component is on the order of O(logA^). To see 
this, we use the approximate formula I w °^ ^ .il4| of 
the diameter of a random graph with size M and average 
degree d. We know that the size of the percolated graph 
is ff^^ (fc) and that the average degree is approximately 
2 19]. Thus the diameter of the giant component is: 



log(2) 

log -r^ h log Z + log(fc) 



= 0(logN). 



log(2) 

At this point we have presented the main result. If 
we can cache content on high degree nodes, and query 
by percolation starting from a high degree node, we will 
always find the content we are looking for. We have not 
yet addressed how each node can find a high degree node. 
In the next section we show that by taking a short ran- 
dom walk through the network we will reach a high de- 
gree node with high probability, and this gives us the 
final piece we need to make the network searchable by all 
nodes. 



Since the degrees of the nodes in the network are inde- 
pendent, each step of the random walk is an independent 
sample of the same trial. The probability of reaching a 
high degree node within -p- steps is: 



1 - (1 - Pr 



.a/P^ 



> 1-e" 



Therefore, after 0{1/Pr) steps, a high degree node will 
be encountered in the random walk path with high (con- 
stant) probability. Now we need to compute Pr for t = 2 
and 2 < r < 3. Since for all decreasing, positive, f{k) 
we have Eta /(^) > C fik)dk > j' f{k)dk and 

12k=a fi^) < Ia-1 f(k)'^k' ^^ '^^^ bound the following 
sums. 

If r = 2, we have the probability of arriving at a node 
with degree greater than 
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and C = J2k=T k^^ < log(fcmaa;) • Wc finally get: 

-log 2 
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(4) 



For r = 2, then in 0{1/P2) ~ Oilogkmax) steps we have 
reached a high degree node. 

If 2 < T < 3, we have the probability of arriving at a 
node with degree greater than ^"'"' is: 



P. 



> 



c 

(2^-2 _ ^^ 



T-2 



/^jLr-2 ■ 



and C = Y1Z7 k-^^' < 7^(1 - I^)- We finally get: 



C. Random Walks Reach High Degree Nodes 

Consider a random PL network of size N and with 
maximum node degree kmax- We want to compute the 
probability that following a randomly chosen link one 
arrives at a high degree node. To find this probability, 
consider the generating function Gi {x) [13 of the degree 
of the nodes arrived at by following a random link: 



Gi(x) = 



J^k'^T k 



-r+l^fe-l 



c 



(3) 



where C = J2k=T ^ 

of arriving at a node with degree greater than -^ 

be: 



This results in the probability 

to 
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(5) 



For 2 < T < 3, then in 0(1/Pr) = 0{k^^l) steps we 
have reached a high degree node, which is polynomially 
large in kmax rather than logarithmically large, as in the 
case of T = 2. 

A sequential random walk requires 0(k^^) time steps 
to traverse 0(fc^~^) edges, and hence, the query implan- 
tation time will dominate the search time, making the 
whole search time scale faster than 0(log N). Recall that 
the percolation search step will only require O(logA^) 
time, irrespective of the value of r. A simple parallel 
query implantation process can solve the problem. To im- 
plement k^^ query seeds for example, a random walker 
with time to live (TTL) of K = log2 k^^ will initiate a 
walk from the node in question and at each step of the 
walk it implants a query seed, and also initiates a second 



random walker with time to live K — I. This process will 
continue recursively until the time to live of all walkers 
are exhausted. The number of links traversed by all the 
walkers is easily seen to be: 



E. On Maximum Degree kmax 

There are two ways to generate a random PL network: 
(i) Fix a k^ax smd normalize the distribution, i.e., 



K-l 



^2' = 2^-1 



j=0 



'^max ^ ■ 



Pk = 


— Ak , U < /C ^ f^max ; 


(8) 


where A^^ - 


kmaa; 


(9) 



fc=l 



FigureOlgives simulation results to show that the parallel 
walk is effective, and thus search time scales as 0(log A^) 
for all 2 < T < 3. In practice, for values of r close to 
two, the quality of search is fairly insensitive to how the 
number of query implants are scaled. 



D. Communication Cost or Traffic Scaling 

Each time we want to cache a content, we send it on 
a random walk across L{N,t) = 0{1/Pr) edges. When 
we make a query, if we reach the giant component, each 
edge passes it with probability q (if we don't reach a 
giant component only a constant number of edges pass 
the query). Thus, the total communications traffic scales 
as qE = qc{k)N/^. Since qc = (fc)/(fc^) we have Cr = 

0{%^). For aU 2 < r < 3, (fc^) = 0{kl-J,). For r = 2, 

(k) = log kmax which gives 



Co^O 



log kmaxN 



(6) 



For 2 < r < 3, (fc) is constant which gives 

Cr=0 {kl^alN) 



(7) 



In section IIII Al we showed that the number of high 
degree nodes H = 0{N/k^l.). We also know that 
L{N,t) ^ a/Pr and Pa = Cl(l/logfc,„aa;) and Pr = 
0{l/k^J^). Thus we can rewrite the communica- 
tion scaling in terms of the high degree nodes, Cr = 
0{L{N,t)'^H). So we see that communication costs 
scales linearly in H, but as the square of the length of 
the walk to the high degree nodes. This meets with our 
intuition since the high degree nodes are the nodes that 
store the cache and answer the queries. 

In the next section we discuss explicit scaling of k,nax 
to get communication cost scaling as a function of N. 
Tables |2 and ^1 show the scaling of the cache and com- 
munication cost in N. We see that for all r < 3, we have 
sublinear communication cost scaling in N. 



To construct the random PL graphs, N samples are 
then drawn from this distribution. For several reasons, 
the choice kmax = 0{N^/'^) is recommended in the liter- 
ature Q, and in our scaling calculations (e.g.. Table P) 
we follow this upper bound. 

(ii) No a priori bound on the maximum is placed, and 
A'^ samples are drawn from the distribution pk — Ak^'^ , 
where A~^ = J2^=i ^~^- I* is quite straightforward to 
show that almost surely, kmax = 0[N^^^^). Thus, when 
T = 2, kmax — cN (1 > c > 0) in this method of gener- 
ating a random PL graphs. 

A potential problem with using the larger values of 
kmax, as given by method (ii), is that the assumption 
that the links are chosen independently might be vio- 
lated. Random graph assumptions can be shown to still 
hold when the maximum degree of a power-law random 
graph is kmax — 0{N^/'^) Q. This however does not 
necessarily mean, that the scaling calculations presented 
in the previous section do not hold for kmax ~ 0{N ^-^ ). 
In fact, extensive large-scale simulations (see Section Hvjl 
suggest that one can indeed get close to poly-logarithmic 
scaling of traffic (i.e., 0(log N)), as predicted by the 
scaling calculations in this section. 

There are several practical reasons for bounding kmax , 
as well. First, in most grown random graphs, kmax scales 
as N^''^ . While grown random graphs display inherent 
correlations, we would like to compare our scaling pre- 
dictions with performance of the search algorithm when 
implemented on grown graphs. Hence, the scaling laws 
that would be relevant for such P2P systems correspond 
to the case of bounded kmax- Second, since the high de- 
gree nodes end up handling the bulk of the query traffic, 
it might be preferable to keep the maximum degree low. 
For example, for r = 2, the traffic generated is of the 
same order as the maximum degree, when kmax — cyN, 
thus providing a balance between the overall traffic and 
the traffic handled by the high degree nodes individually. 





Cache Size (TTL) 


Query Cost 


T = 2 


(log TV) 


(log" AT) 


2 <T < 3 


O(iV-i) 


0(iV —1 ) 



TABLE I: The scaling properties of the proposed algorithm 
when kmax = O(A^^^t). 





Cache Size (TTL) 


Query Cost 


T = 2 


OdogiV) 


oaog"(Af)iVi/") 


2<T<3 


□ (•^l-2/r) 


0(iV2-3/^) 



TABLE II: The scaling properties of the proposed algorithm 
when fc„a, = 0(iVi/"). 



Hit Rate 


50% 


75% 


90% 


98% 


Unique Replicas 


1.3e-3 


2.4e-3 


3.2e-3 


6.8e-3 


10 Replicas 


N/A 


N/A 


2.0e-4 


4.7e-4 



TABLE III: The fraction of edges (i.e., the ratio of the traffic 
generated by the percolation search and the traffic generated 
by the straight-forward search where queries are relayed on 
every edge) involved in a search for various hit-rates when 
(i) Each node has a unique content, and (ii) 10 replicas of 
each content are distributed randomly in the network. The 
results are for a power-law network with t — 2, N = 30K, 
and TTL=25 for both query and content implants (see Figs. 
m andEJ. 
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FIG. 1: The hit-rate as a function of the fraction of links 
used in search, for networks with r = 2, 2.3. The number of 
nodes is 30000 and the TTL is 25 for both query and content 
implants. For the case of r = 2. For the case of r = 2, 
kmax ~ 2N"'^ ~ 350, while for the PL network with r = 2.3 
the maximum degree is kmax ~ 2N^''^'^ ~ 176. 



IV. SIMULATIONS ON RANDOM PL 
NETWORKS 



For all the simulations reported in this section, a ran- 
dom power-law graph is generated with the method re- 
ported in |13J |. The minimum degree of the nodes are 
enforced to be two so that any node is part of the giant 
connected component with probability one (see 13]) . 
Note that in the simulations, TTL refers to the length of 
the random walks performed for content-list replication 
and query implantation. The scaling enforced (if any) 
on the maximum degree (kmax) is also reported for each 
simulation. 



A. Hit-rate vs. Traffic 

Fig. Hshows the hit rates achieved assuming that each 
node has a unique content. As expected, for the same 
traffic (i.e., the number of links used in the bond perco- 
lation stage of the algorithms) the hit rate for r = 2 is 
greater than that for r = 2.3. Some of the statistics for 
hit rates and corresponding traffic are listed in Table ITTll 

Fig. 13 illustrates the first-order phase transition of 
query hit-rates, as opposed to the second-order phase 
transition of the size of the largest connected component, 
as a function of the percolation probability. As noted in 
the introduction, this first-order phase transition is a key 
aspect of the proposed algorithm. 

Fig. 121 shows the performance of the search algorithm, 
when the query-implantation step for the case of r > 2 
is executed in parallel vs. when it is executed serially. 
Recall that for r > 2 the number of independent queries 
required to ensure that one of the implanted queries is 
on a node that is part of the giant connected component. 
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FIG. 2: The hit-rate, fraction of links and fraction of nodes 
used in the search as a function of the percolation probability 
plotted together for comparison. While there is a sudden 
jump in the hit-rate just above the percolation threshold (an 
indication of a first order transition), the number of links 
and nodes participating in the search increases much more 
gracefully (an indication of a second order transition, also 
manifested in the linear growth of these parameters just above 
the percolation threshold), t — 2 and number of nodes is 
30000 with kmax = 400. 



scales faster than O(logA^). Since, the query implanta- 
tion time, if the implantations were carried out by a serial 
random walk, would dominate the desired search time of 
O(logiV), we introduced a parallel query implantation 
process (branching random walk) , where the walker con- 
structs a binary tree, such that the total number of nodes 
in the tree is the number of required query implantations. 
As shown in Fig|31 the performance of the branching ran- 
dom walk is as good as a serial random walk. 
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FIG. 3: Comparison of the hit-rate in parallel (circles ) and 
serially (squares) implanted queries. In each case the total 
number of queries are 16. The parallel implant uses four 
branching random walks each of size 4 and hence the total 
implantation time is 8. While the serial implantation is a 
simple random walk of size 16 and takes 16 time units. The 
network has r = 2.3 and N = 10000. 



FIG. 4: The fraction of contents found as a function of the 
number of times the search was repeated: Suppose a fraction r 
of contents were not found at the first try. If successive queries 
were independent, the fraction of contents after the K'th try 
should be around 1 — r^. The actual fraction is plotted along 
with what one expects from random tries. The network has 
size A'' — 30, 000 and r = 2. TTL's are deliberately chosen to 
be very low {~5), so that r is large (> 70%). 



B. Repeated Trials 



The results of Section IIIII guarantee that every con- 
tent will be found with probability approaching one, as 
long as the content and query implantation steps are long 
enough. However, in practice, at any percolation prob- 
ability we will get a hit rate that is < 1, and the issue 
is what the behavior of the search algorithm would be 
if one repeated the query a few times. If each search is 
independent of the others, then we expect the hit rate 
to behave as 1 — (1 — pY, where p is the hit rate for a 
single attempt, and r is the number of attempts. Fig. 
0] shows simulation results verifying this aspect of each 
query attempt being almost independent of others. The 
fact that the hit rate can be increased by repeated trials, 
is very important from an implementation perspective: 
one does not need to know the percolation threshold and 
the exact scaling of TTL's in order to obtain very high 
hit rates. As shown in our simulations (Fig. 0)), even 
if we start with only a 30% hit rate, the hit rate can be 
increased to almost 90% in only seven attempts. 
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FIG. 5: The hit-rate as a function of the fraction of links used 
in search, for t = 2 for the case when 10 copies of each content 
is in the network, along with the case of unique contents for 
comparison. The number of nodes is 30000, kmax ~ 375 and 
the average degree is 6 and the TTL is 25 for both query and 
content implants. 



C. Content Replication 



Next we consider another relevant issue: what would be 
the improvement in performance if multiple nodes in the 
network had the same content. As part of the percola- 
tion search algorithm, we already execute a caching or a 
content implantation step that makes sure that a subse- 
quent query step would find any content with probability 
approaching one. Now, if I nodes share the same con- 
tent, then it will be implanted via I different independent 
random walks. In the case of random PL graphs, the I 
different random walks for content implantation is equiv- 



alent to looking for a content I times independently (i.e., 
performing the query implantation and bond percolation 
steps I times), while performing the content implantation 
random-walk only once. Hence, in our percolation search 
algorithm, content replication (due to nodes having the 
same content), improves the hit rate exponentially closer 
to 1. The hit rates for unique vs. 10 copies of contents 
are shown in Fig. \S\ 
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FIG. 6: Scaling behavior of the percolation probability re- 
quired for a fixed hit-rate of 95% as a function of the network 
size for networks with r = 2, r = 2.3. The TTL is increased 
according to Tableland the maximum degree is forced to be 
N^'"^ . The scaling predictions according to the Table UTI are 
also shown for comparison. 
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FIG. 7: Scaling behavior of the fraction of links required for 
a fixed hit-rate of 90% as a function of the network size for 
a network with r = 2. The TTL is increased according to 
Tableland the maximum degree is forced to be 2\/N . The 
scaling is slightly improved to O (N~^'^^ for the fraction of 
links required. 




D. Traffic Scaling 

Fig. El shows actual scalings observed in our simula- 
tions for various choices of r and kmax- The predicted 
scaling laws provide a good fit for the observed data when 
kmax is chosen to be 0{N^/'^). The scaling for the per- 
colation probability required for a high hit rate matches 
those predicted for the traffic reported in Table |nj On 
the other hand, while Fig. El shows the scaling of the 
percolation probability necessary to obtain a given target 
hit-rate, the actual number of links traversed is in fact 
even less. If each and every link had the chance to be 
traversed with the percolation probability, then the ac- 
tual traffic would directly correspond to the percolation 
probability. A broadcast started from a query implant, 
however, might end up at dead-end nodes close to this im- 
plant. That results in the actual scaling of the traffic to he 
slightly better than the scaling of the required percolation 

probability. For r = 2, for example, the O (log^ N/^/N 

scaling verified in Fig. Elhas been modified to O I 1/Vn 

as experimentally verified in Fig. [3 

More significantly, even when kmax scales faster than 



FIG. 8: Scaling behavior of the percolation probability re- 
quired for a fixed hit-rate of 95% as a function of the network 
size for a PL network with r = 2, on a log-log basis. The 
TTL is increased according to Table U The maximum degree 
however is forced to be 2N^''^. The predicted scaling is also 
depicted for comparison. 



N^/'^ , the same theoretical scaling laws seem to hold. As 
an example of how the traffic scaling laws are we have 
provided simulations for the case of k^ax ~ N^^^ (Fig- 



|SJ|, and A:,, 



N (Fig. Ej) for r = 2. 



V. MAKING UNSTRUCTURED P2P 
NETWORKS SCALABLE 

As noted in the introduction, a number of schemes 
have been proposed to address the scaling problem in 
unstructured P2P networks, and the following are a few 
of the more important ones: 

1. Ultra-peer Structures and Cluster-Based Designs: 
A non-uniform architecture with an explicit hierarchy 
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FIG. 9: The scaling of the percolation probability required for 
a hit rate of 95%,when k,y^ax = iV/4 and r = 2 and TTL = 25. 
The scaling for log'^ ]S[M~'^'^^ is also depicted for comparison. 
It is important to note that simulations for such large values 
of kmax are fraught with difficulties. This simulation however 
confirms the fact that while the scaling results are precise 
when kmax ~ 0{N^''^) they still closely match the simulations 
even in the extreme case of kmax = 0{N). 



seems to be the quickest fix. This structure was also 
motivated by the fact that the nodes in the network are 
not homogeneous; a very large fraction of the nodes have 
small capacity (e.g. dial-up modems) and a small frac- 
tion with virtually unbounded capacity. The idea is to 
assign a large number of low capacity nodes to one or 
more Ultra-peers. The Ultra-peer knows the contents of 
its leaf nodes and sends them only the relevant queries. 
Among the Ultra-peers they perform the usual broadcast 
search, where each query is passed on every edge. 

The Ultra-peer solution helps shield low bandwidth 
users; however, the design is non-uniform, and an explicit 
hierarchy is imposed on the nodes. In fact, the two-level 
hierarchy is not scalable in the strict sense. After more 
growth of the network, the same problem will start to ap- 
pear among the Ultra-peers, and the protocol should be 
augmented to accommodate a third level in the hierarchy, 
and so on. In a more strict theoretical sense, the traffic 
still scales linearly, but is always a constant factor (de- 
termined by the average number of nodes per ultra-peer) 
less than the original Gnutella system. Cluster-based de- 
signs [111 ^^^ more centralized versions of practically the 
same idea, and therefore suffer from the same issues. 

Note that the percolation search algorithm naturally 
distills an Ultra-peer-like subnetwork (i.e., the giant con- 
nected component that remains after the bond percola- 
tion step) , and no external hierarchy needs to be imposed 
explicitly. Moreover, we show in Section IVll that even if 
the random graph's degree distribution is a mixture of 
two different distributions (e.g., a heavy-tailed PL with 
T sa 2, and a light tailed PL with r > 4), the percola- 
tion search algorithm naturally shields the category of 
nodes with light-tailed degree distribution, and most of 
the traffic is handled by the nodes with heavy-tailed de- 



gree distributions. 

2. Random Walk Searches with Content Replication: 
Lv et.al.llJl analyze random walk searches with content 
replications, and their strategy is close to the work of 
Adamic et. al, which was reviewed in Section ^]. The 
idea is very simple: for each query, a random walker 
starts from the initiator and asks the nodes on the way 
for the content until it finds a match. If there are enough 
replicas of every content on the network, each query 
would be successfully answered after a few steps. In 
|13 | it is assumed that a fraction Ai of all nodes have 
the content i. They consider the case where A, might 
depend on the probability (g^) of requesting content i. 
They show that under their assumptions, performance is 
optimal when Ai oc y^. 

This scheme has several disadvantages. Since high con- 
nectivity nodes have more incoming edges, random walks 
gravitate towards high connectivity nodes. A rare item 
on a low connectivity node will almost never be found. 
To mitigate these problems, jl^] suggests avoiding high 
degree nodes in the topology. 

Moreover, this scheme is not scalable in a strict sense 
either: even with the uniform caching assumption satis- 
fied, the design requires 0{N) replications per content, 
and thus, assuming that each node has a unique content, 
it will require a total of 0{N'^) replications and an aver- 
age 0{N) cache size. The above scaling differs only by 
a constant factor from the straightforward scheme of all 
nodes caching all files. Finally, it is a serial search algo- 
rithm, thus compromising the speed of query resolution. 

Clearly, the percolation search algorithm has several 
advantages over this scheme and they are almost iden- 
tical to the one's stated in Sectional where the perco- 
lation search and the random-walk based searches were 
compared. Moreover, the percolation search algorithm 
finds any content, even if only one node in the network 
has it, while the above algorithm relies on the fact that 
a constant fraction of the nodes must have a content, in 
order to make the search efficient. 



A. Percolation Search on Limew^ire Crawls 

We next address the issue of how well would the per- 
colation search algorithm work on the existing P2P net- 
works. For our simulations we have used a number of 
such snapshots taken by Limewire |lfil |. In particular, 
we have used snapshots number 1,3,5 from |20j with 
N = 6AK, UK, 30K respectively. 

There are two important features about these snap- 
shot networks that are relevant to our discussions: 
(i) Because of how one crawls the network, the resulting 
snap-shot subnetworks are inherently networks obtained 
after bond percolation, where the percolation probability 
is high but not unity. The scaling laws of the percola- 
tion search algorithm suggest that the performance of the 
search on the actual graphs to be even better than those 
reported here. 
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Hit Rate 


50% 


75% 


90% 


98% 


Unique Replicas 


3.1e-3 


7.1e-3 


1.3e-2 


2.8e-2 


10 Replicas, 2 tries 


l.le-3 


1.3e-3 


2.5e-3 


6.3e-3 


10 Replicas, 1 tries 


1.3e-3 


2.3e-3 


2.5e-2 


4.6e-2 



TABLE IV: For the Limewire crawl 7^ 5: the fraction of orig- 
inal Gnutella traffic required for various hit-rates when all 
contents are unique as well as the case where 10 replicas of 
each content are in the network. The case of two tries with 
10 replicas is also quoted. 
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FIG. 10: The hit-rate as a function of the fraction of links used 
in search, for Limewire crawl#5, #3, #1. The ratio of the 
variance to the mean for different crawls are indicated. The 
performance of the percolation search algorithm is seen to be 
dependent on this ratio, the higher the variance to mean, the 
better the performance of the percolation search algorithm. 
The TTL used for both query and content implant has length 
25 for all cases. 



(ii) The degree distributions of these networks are not 
ideal power-laws, and at best they can be categorized 
as heavy-tailed degree distributions. A good measure of 
heavy-tailed degree distribution is the ratio of the vari- 
ance and the mean. In PL networks with heavy tails, i.e., 
2 < r < 3, this ratio is unbounded and goes to infinity as 
the network size grows. However, one does not need these 
ideal conditions for the percolation search algorithm to 
provide substantial reduction in traffic (see Figs. 1101 and 
lll|l . Recall that the search traffic generated in the per- 
colation search algorithm is approximately {k)Nqc, and 
hence is directly proportional to {k)qc. We further know 

(fc)^ 
that {k)qc ~ 77:^ • Thus, as long as the graph has a high 

(fc2) 

root-mean-squared (RMS) to mean ratio, we expect the 
percolation search algorithm to show substantial gains. 
This is indeed the case in the implementations of our al- 
gorithm in the crawl networks. Table HVl shows that the 
overall traffic can be reduced by 2 to 3 orders of magni- 
tude without compromising the performance. 



FIG. 11: For the Limewire crawly 5: the hit- rate as a function 
of the fraction of the links required, when all contents are 
unique as well as the case where 10 replicas of each content 
are in the network. The case of two tries with 10 replicas is 
also quoted for comparison. 



VI. PERCOLATION SEARCH IN 
HETEROGENEOUS PL RANDOM GRAPHS 

So far, we have assumed a uni-modal heavy tailed dis- 
tribution for the networks on which percolation search 
is to be performed. In reality, however, most networks 
are heterogeneous, consisting of categories of nodes with 
similar capabilities or willingness to participate in the 
search process; e.g., the dominant categories in existing 
P2P networks are, modems, DSL subscribers, and those 
connected via high-speed T-1 connections. Thus, the de- 
gree distribution in a real network is expected to be a 
mixture of heavy-tailed (for nodes with high capacity) 
and light-tailed (for nodes with lower capacity) distri- 
butions. We now show that the superior performance 
of the percolation search algorithm is not limited to the 
case of a uni-modal power-law random graph. In fact, 
as discussed before the percolation search performs well 
as long as the variance of the degree distribution is much 
larger than its mean. 

Consider as an example, the case of a bi-modal net- 
work, where a fraction x of the nodes have degree distri- 
bution Pfc with a heavy tail, while the rest have a light- 
tailed degree distribution Qk. Assume that the average 
degree of the two categories of nodes are the same for 
the sake of simplicity. The percolation threshold q^^ of 
this graph is then related to qc the percolation threshold 
of a graph with the same degree distribution as of Pk 
as: gj* « qc/x. Therefore, as long as a good fraction of 
all the nodes have a heavy tail, all observations of this 
paper still hold for a heterogeneous network. As far as 
the overall traffic is concerned, the total number of links 
traversed is at most {xN)p^^ — Npc or the same as the 
case where all nodes had a heavy tailed distribution Pk. 
The query and content implantation times are however a 
bit longer in this case. 

Percolation search on heterogeneous networks, on the 
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other hand, naturaUy provides traffic shielding to low ca- 
pabilities nodes. Consider again a network with say two 
categories of nodes. The percolation search works by 
cutting out many links of the network, and therefore al- 
most all nodes participating in the search process are the 
ones that are highly connected, which are almost surely 
part of the heavy tailed group. For instance, if the light 
tailed group has exponential degree distribution, then the 
probability of any of node of the light tailed category par- 
ticipating in the search process is exponentially small. 
Naturally then, the nodes of the light tailed category are 
exempted from participation in the search process. See 
the following table for a typical simulation result. 



I heavy tailed 11 light tailed I overall I 



3.50e-2 



2.22e-5 6.12e-3 



TABLE V: The fraction of nodes that participated in a search 
for a hit rate of 98%, in a network consisting of two power- 
law modes: 4000 nodes (called the heavy tailed mode) have 
a power-law exponent r = 2 while 20000 others (called the 
light tailed mode) have an exponent r = 4. TTL of 20 was 
used for both query and content implants. 



VII. CONCLUDING REMARKS 

We have presented a scalable search algorithm that 
uses random-walks and bond percolation on random 
graphs with heavy-tailed degree distributions to provide 



access to any content on any node with probability one. 
While the concepts involved in the design of our search 
algorithm have deep theoretical underpinnings, any im- 
plementation of it is very straightforward. Our exten- 
sive simulation results using both random PL networks 
and Gnutella crawl networks show that unstructured P2P 
networks can indeed be made scalable. 

Moreover, our studies show that even in networks with 
different categories of nodes (i.e., graphs where the de- 
gree distribution is a mixture of heavy-tailed and light- 
tailed distributions) the search algorithm exhibits the fa- 
vorable scaling features, while shielding the nodes with 
light-tailed degree distribution from the query-generated 
traffic. Our recent results i21] indicate that it is indeed 
possible to have local rules, that will enforce a desired 
category of the nodes in the network to have either a 
heavy or light tailed degree distribution. One can thus 
make sure that the subgraph consisting of the nodes with 
low capacity has a light tail, and is thus exempted from 
the search traffic with high probability. On the other 
hand, the high capability nodes evolve into a subgraph 
with a heavy tail degree distribution and hence will carry 
the majority of the search load. 

Together with the new algorithms for building heavy- 
tailed growing graphs, even in the presence of extreme 
unreliability of the nodes, and a heterogeneous sets of 
nodes (in terms of connectivity and bandwidth capac- 
ities), the percolation search algorithm can provide an 
end-to-end solution for constructing a large scale, highly 
scalable, and fault tolerant distributed P2P networking 
system. 
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